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ABSTRACT 

In this paper, a hierarchical context definition is added to 
an existing clustering algorithm in order to increase its ro- 
bustness. The resulting algorithm, which clusters contexts 
and events separately, is used to experiment with different 
ways of defining the context a language model takes into 
account. The contexts range from standard bigram and tri- 
gram contexts to part of speech five-grams. Although none 
of the models can compete directly with a backoff trigram, 
they give up to 9% improvement in perplexity when interpo- 
lated with a trigram. Moreover, the modified version of the 
algorithm leads to a performance increase over the original 
version of up to 12%. 

1. Introduction 

The task of a language model is to calculate p(wi\d), the 
probability of the next word being Wi given the current con- 
text d. Language models differ in the way this probability is 
modelled and how the context c; is defined. A quite general 
model proposed in Q makes use of a state mapping function 
S and a category mapping function G. The idea behind the 
state mapping S : c— > s c = S(c) is to assign each of the 
large number of possible contexts c £ C to one of a smaller 
number of context-equivalent states. Similarly, the category 
mapping G : w— > g w = G(w) assigns each of the large 
number of possible words w 6 V to one of a smaller number 
of categories (similar to parts of speech) . The probability of 
the next word is then calculated as 



p(wi\ci) =p(G(wi)\S(ci))*p(wi\G(wi)). (1) 

In H, a heuristic version of a clustering algorithm was pre- 
sented, which can be used to calculate S and G automati- 
cally. In this paper, the algorithm is extended to deal with 
a hierarchy of contexts, which increases its robustness (Sec- 
tion ^j). It is then used to experiment with different ways 
of defining the context, including the use of parts of speech 
information. The different models are evaluated in terms of 
perplexity on the Wall Street Journal Corpus (Section O. 



2. Clustering Algorithm 

The initial clustering algorithm used to determine S and 
G automatically is shown in Figure ^. It is a greedy, hill- 
Algorithm 1: Clustering() 

start with initial clustering functions S, G 
iterate until some convergence criterion is met 
for all w G V and c 6 C 

for all g' m £ G and s' c 6 S 

calculate the difference in the optimisa- 
tion criterion when w/c is moved from 
g w /s c to g' w /s' c 
move the w/c to the g' w /s' c that results in 
the biggest improvement in optimisation cri- 
terion 
End Clustering 

Figure 1: The clustering algorithm 

climbing algorithm that moves elements to the best avail- 
able choice at any given time. For more details about the 
algorithm, the optimisation criterion and its heuristic version 
(which is used in all the experiments reported here), please 
refer to Q. 

A major drawback of the algorithm becomes apparent when 
it is used for wider contexts. Since S clusters individual 
contexts, many of these contexts have occurred only infre- 
quently in the training data. It is therefore very difficult to 
assign them to a meaningful cluster. In fact, the algorithm 
doesn't attempt to move elements which have occurred less 
than a minimal number of times (the empirically determined 
value of 6 was used for this threshold in our experiments). 
Depending on the number of elements for which this is true, 
this can lead to poor performance. In the trigram case, for 
example, 85% of the distinct contexts seen during training 
have occurred less than 6 times. 

The main idea to improve upon this situation is as follows. 
Rather than moving individual contexts, the algorithm first 
moves groups of contexts together. Each group will have oc- 
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wi-2=wl wi-2=wl wi-2=wl wi-2=wm 

Figure 2: Example of a trigram tree 



curred more frequently and hence its statistics will be more 
reliable. Only later on is the algorithm allowed to move 
individual contexts. As an example consider the trigram 
case, where the context is defined by the pair of previous 
words (wi-2,Wi-i). Initially, the algorithm moves all con- 
texts which have the same Wi-i together (e.g. identical bi- 
gram contexts). Subsequently, it proceeds by moving pairs 
of words. 

In more general terms, we can represent the groupings of the 
contexts in terms of a tree T. The leaves of T correspond to 
all the different contexts seen during training. The nodes at 
each of the < I < L — 1 levels of the tree correspond to a 
classification of all the contexts into a smaller set of groups. 
For example, the tree shown in Figure ^| corresponds to the 
trigram case, where context with the same tOj-i are grouped 
together. 

It is quite simple to modify the clustering algorithm to make 
use of such a tree T. Let N(T, I) denote the set of nodes of 
T at level I and let Contexts(n) denote the set of contexts 
below a node n (e.g. all the leaves dominated bv n). The 
resulting clustering algorithm is shown in Figure 

Although a tree can be used to represent many different ways 
of grouping contexts, we have so far only experimented with 
very simple trees. Let each context c be defined by a L- 
tuple of values c = (vl, ...Vi). The trees used in our 

experiments always group contexts which have identical sub- 
contexts together. Thus, the i th level of the tree has one node 
for each existing i-tuple of values (vi, ...,vi) and each such 
node contains all the contexts which are further refinements 
of this i-tuple. 

3. Test Corpus and Clustering Times 

Using the non-verbalised version of the Wall Street Journal 
corpus (approximately 38 million words, 20,000 word vocab- 
ulary), different language models were evaluated in terms of 
perplexity. We use the same conditions as || and || in order 
to make direct comparisons of perplexity possible. 

All of the results described in this paper were obtained with- 



Algorithm 2: Clustering() 

start with initial clustering functions S, G 
for each level / of tree T 

iterate until some convergence criterion is met 
for all w € V and n G N(T, I) 
for all g' w G G and s'„ G S 

calculate the difference in the opti- 
misation criterion when w is moved 
from g w to g' w or when all c G 
Contexts(n) are moved from s n to 
s' 

move the w to the g' w or all c G 
Contexts(n) to the s' n that result in the 
biggest improvement in the optimisation 
criterion 

End Clustering 



Figure 3: The tree-based clustering algorithm 
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Table 1: Comparison of the clustering algorithm and stan- 
dard n-grams 

out putting much effort into individual parameter tuning. 
The threshold below which elements are not moved by the 
algorithm was set to 6 in all experiments. As convergence 
criterion, the relative improvement in the value of the opti- 
misation function during the last iteration was used. If that 
improvement is less than 1%, no more iterations are per- 
formed. This results in only two iterations in most cases, 
which is significantly less than the about 20 to 30 iterations 
mentioned in |J. Hence there is reason to believe that some 
of the results could be improved upon by better optimisation 
of these parameters. 

Using the heuristic version of the algorithm presented in |J , 
one iteration in the bigram case takes about 5 hours (elapsed 
time, not CPU) on a DEC alpha workstation. The complete 
clustering takes about 10 hours for a bigram and 3 days for 
a trigram. The time required for most of the other models 
lies in between the bigram and trigram case. 

4. Results 

In a first set of experiments, the clustering algorithms were 
compared to the standard bigram and trigram models. The 
results are shown in Table hi First, one can compare our 
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Table 2: Using 61 linguistic parts of speech tags t 



Table 4: Interpolation with the backoff trigram 
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Table 3: Using 1000 clustering classes g 



backoff results to those reported in |2j. Our backoff bigrarn 
result is about 2%, the trigram result about 7% worse. The 
difference could be explained by the different smoothing tech- 
nique we use and by the fact that our trigram discards sin- 
gleton events. Second, one can see that the backoff models 
outperform the clustered models. This is especially true for 
the trigram. It is worth noting, however, that the trigram 
has approximately 14 million parameters, as compared to 4 
million for the clustered model. Third, Table [I] also shows 
that the tree-based version of the algorithm outperforms the 
original one, giving an improvement of 12%. Finally, the 
clustered bigram results using 500 clusters allows a direct 
comparison with the one given in g|, where a very similar 
perplexity figure of 244 is given. 

In a second set of experiments, the use of parts of speech 
information in the context definition was investigated. Due 
to limitations of our software, each word could belong to one 
part of speech only. Brill's rule based tagger was therefore 
employed to assign the most likely tag t to each word in 
the official 20K vocabulary used in the language modeling 
experiments. This resulted in 61 different tags. Table H 
gives the results for various models using this part of speech 
information. As the size of the context window increases, 
the tree based version of the algorithm gives an increasing 
gain in performance . When moving from a window size of 
three to four, the standard version of the clustering algorithm 
does not lead to an improvement (by looking at one extra 
digit, one can see that it decreases from 304.6 to 305.0). 
This is presumably because of the data sparseness problem 
mentioned in Section [T] The performance of the tree based 
version, however, continues to increase. 

In a third set of experiments, the clustering of words G pro- 
duced by the algorithm was used to define the context. Com- 
pared to using the linguistic parts of speech, this has the ad- 
vantage that the number of classes can be determined almost 
at will. The perplexities for a model that uses 1000 different 
classes are shown in Table ^| One can again see the benefit 
of using the tree based version. Moreover, it is interesting to 
note that the resulting perplexity comes quite close to that 
of a clustered trigram. 



trigram. The results are shown in Table [|. One can see 
that the interpolation with the backoff trigram leads to an 
improvement of up to 9% over the backoff trigram by itself. 

5. Conclusion 

An existing clustering algorithm was extended to deal with a 
hierarchical definition of contexts. This lead to a significant 
perplexity improvement of up to 12%. The resulting algo- 
rithm was used to experiment with different ways of defining 
the contexts. Although none of the models outperform a 
backoff trigram, they lead to a perplexity improvement of 
up to 9% when interpolated with a trigram. 
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In a final set of experiments, some of the previously inves- 
tigated models were interpolated linearly with the backoff 



